Method for determining discrete quantitative structure activity relationships

ABSTRACT

Method for developing a quantitative structure activity relationship that includes obtaining a training set of chemical compounds with molecular descriptors consisting of a number of multidimensional vectors with an activity class for each of the vectors; partitioning the multidimensional vectors into groups having interdependence; transforming the descriptors such that the interdependence of the groups is lessened; estimating a probability distribution of the descriptors by assuming that a probability distribution of a product of each of the groups is approximately equal to the probability distribution of the molecular descriptors; performing the partitioning, transforming and estimating steps for each of the activity classes; and, developing a probability distribution for the activity classes.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a continuation of application Ser. No.09/252,912, which is incorporated herein by reference, and whichapplication Ser. No. 09/252,912 claims priority of United Kingdomapplication no. 9803466.3, filed Feb. 19, 1998.

FIELD OF THE INVENTION

[0002] This invention relates generally to the method of determiningrelationships between the structure or properties of chemical compoundsand the biological activity of those compounds.

BACKGROUND OF THE INVENTION

[0003] The pharmaceutical and biotechnology industries are continuouslysearching for effective therapeutic or diagnostic agents. The processesfor finding effective agents includes target identification, ligandidentification, toxicology and clinical trials.

[0004] Target identification is basically the identification of aparticular biological component, namely a protein and its associationwith particular disease states or regulatory systems. A proteinidentified in a search for a chemical compound (drug) that can affect adisease or its symptoms is called a target. Proteins are large chemicalcompounds comprising a polymer chain of amino acids. The word protein isused herein to refer to any chemical compound that is involved in theregulation or control of biological systems (e.g. enzymes) and whosefunction can be interfered with by a drug.

[0005] The word disease is used herein to refer to an acquired conditionor genetic condition. A disease can alter the normal biological systemsof the body, causing an over or under abundance of chemical compounds(i.e. a “chemical imbalance”). The regulatory systems for these chemicalcompounds involve the use, by the body, of certain proteins to detectimbalances or cause the body to produce neutralizing compounds in anattempt to restore the chemical imbalance. The word body is used hereinto refer to any biological system: e.g. plant, animal or bacterial.

[0006] Ligand identification includes search for a chemical compoundthat binds to a particular target. A ligand is a chemical compound thatcan attach itself to a protein and interfere with the normal functioningof the protein. A useful analogy is viewing the protein as a “lock” andthe ligand as a “key.” A ligand that fits the “lock” is called “active.”

[0007] Toxicological and clinical trials involve characterizing theeffects on the entire body of an identified ligand for a particulartarget. Additionally, the overall effectiveness regarding the diseasemust also be measured. These efforts are conducted in model bodies (i.e.generally animals) and then ultimately on the intended body (i.e.generally humans).

[0008] The present invention relates to ligand identification. In otherwords, a target has been identified and the identity of an active ligandis desired. Ligand identification generally involves the developing of ahypothesis that a particular chemical compound will be active,performing a physical experiment to determine if the hypothesizedcompound is active, and if the compound is not active, then returning tothe step of developing a hypothesis.

[0009] There are several methods available for developing hypothesesthat a particular chemical compound will be active.

[0010] A very slow and unpredictable process is introspection. That is,the expertise gained by humans in the hypothesis-experiment process canbe put to use in developing new hypotheses regarding the selection ofcandidate ligands.

[0011] Computer simulation methods have also been proposed to reduce thecost of physical experiments. These methods include simulations ofactivity and suggestions for new candidate ligands. These simulationshave not had broad success and are generally too slow and unreliableunless a number of active compounds have already been discovered andminor modifications are desired to improve some property.

[0012] The current method of choice is generally called high throughputscreening (HTS). This includes the automation of the physical experimentstep with robots so that hundreds of thousands or millions ofexperiments can be performed in a short period of time. This process hasallowed a brute-force approach to ligand discovery. The hypothesis phaseconsists of obtaining large collections of molecules either fromexternal suppliers or through combinatorial chemistry type production oflarge numbers of compounds. Combinatorial chemistry is a methodology inwhich many chemical reactions are performed simultaneously to produce alarge collection of compounds. The large collection of compounds canthen be physically tested with robots and activity results measured.

[0013] The universe of possible ligands is extremely large; estimatedbetween 10⁴⁰ and 10⁴⁰⁰ compounds. Accordingly, even with HTS approachesit is impossible to physically test all possible ligand candidates.Thus, methods are needed to discard the majority of the possibilities inadvance or as the search proceeds.

[0014] It is generally accepted that the structure, composition, orphysical properties of a ligand directly affect its biological activityagainst a target. The attempt to transform this qualitative belief intoa quantitative method of activity assessment is known as thedetermination of Quantitative Structure Activity Relationships, or QSAR.QSAR began with the work of Hansch and was further developed by others.See, Hansch, C., Fujita, T, ρ-σ-π Analysis, A Method for the Correlationof Biological Activity and Chemical Structure, J. Am. Chem. Soc. 1964;Cramer, R. D., Patterson, D. E., Gunce, J. D., Comparative MolecularField Analysis (CoMFA), 1. Effect of Shape on Binding of Steroids toCarrier Proteins, J. Am. Chem. Soc., 1988, 110, 5959-5967; and, Roger,D., Hopfinger, A. J., Application of Genetic Function Approximation toQuantitative Structure-Activity Relationships and QuantitativeStructure-Property Relationships, J. Chem. Info. Comp. Sci., 1994, 34.

[0015] Determining a QSAR generally includes the following steps:

[0016] First, a quantitative measure of activity needs to be defined.

[0017] Second, the ligand needs to be expressed in some quantitativemanner. This step generally includes selecting a collection of numbersthat characterize the ligand. These numbers are called moleculardescriptors or descriptors.

[0018] Then, a functional relationship between activity and the selecteddescriptors must be determined. This includes developing a mathematicalfunction that has the property that “activity=a function of thedescriptors”, to a suitable high level of accuracy.

[0019] The functional relationship and the molecular descriptors aregenerally used to predict the activity of new candidate ligands.

[0020] Activity is traditionally measured as the amount of ligand neededto produce a particular interference with a target. The amount needed ison a continuous scale.

[0021] The selection of molecular descriptors is usuallytarget-specific. Physical properties are often used. Mathematicalproperties based on the line drawing of a chemical compounds are alsoused. The use of the electric field of the ligand as a moleculardescriptor is called Comparative Molecular Field Analysis (CoMFA) andhas been the subject of previous patents. Other molecular descriptorsets include “fingerprints” or holograms”, which are descriptions ofsmall sub-structures in the ligand.

[0022] The most widely used method of determining the functionalrelationship is the statistical technique of regression or leastsquares. Techniques such as genetic algorithms and partial least squaresare used to select the “important” descriptors from the “less important”descriptors or “noise”.

[0023] The use of high throughput screening (HTS) to identify activecompounds has greatly challenged commonly used QSAR techniques. HTSusually generates large amounts of assay data, which initiallyclassifies compounds as active or inactive. In addition, compounds inscreening libraries are typically noncongeneric, i.e., they do not sharesimilar core structures. This makes it difficult, if not impossible, toanalyze HTS data by classical QSAR techniques and to predict activecompounds.

[0024] Higher throughput reduces the precision of the activitymeasurement. Many HTS technologies report a binary condition; acandidate ligand is either “active” or “inactive”. Some HTS technologiesreport a discrete measure; i.e. activity on a scale of 1 to 10. Ineither case, classical QSAR techniques require a continuous activitymeasurement, e.g. accurate to two to three decimal places.

[0025] Many HTS techniques have the unfortunate property that theactivity measurement is error prone. The error rate is significantenough to warrant special attention since classical QSAR technology isvery sensitive to error and outliers (data extremes). A significanterror rate will neutralize the predictive capabilities of classical QSARtechnology.

[0026] To exemplify, consider the following simple example. Suppose thatactivity y is linearly related to a single descriptor x. The linearrelationship is expressed as follows:

y=mx+b

[0027] A conventional data set would consist of n observations(y_(i),x_(i)). Without loss of generality it may be assumed that theslope is greater than zero, m>0, the x_(i) have mean 0 and variance 1,and that activity is indicated by the condition that$\left. {{\begin{matrix}{y < 0} & \left( {i.e.\quad {when}} \right.\end{matrix}\quad x} < \frac{- b}{m}} \right).$

[0028] Using linear regression, the estimates for m and b are:$\begin{matrix}{\quad {{\hat{m} = {\frac{1}{n}{\sum\limits_{i = 1}^{n}\quad {y_{i}x_{i}}}}},}} & {{\hat{b} = \overset{\_}{y}},} & {\overset{\_}{y} = \frac{1}{n}}\end{matrix}{\sum\limits_{i = 1}^{n}\quad y_{i}}$

[0029] When presented with HTS binary measurements (i.e. 1 is active and0 is inactive) representing the condition that y_(i)<0 the linearregression estimates become: $\begin{matrix}{\hat{m} = {\frac{1}{n}{\sum\limits_{{xi} < {{- b}/m}}^{\quad}\quad x_{i}}}} & {\hat{b} = \frac{a}{n}}\end{matrix}$

[0030] where a is the number of active compounds. These estimates arecompletely different than those obtained from non-binary input (e.g.,the b estimate is always in the range [0,1] for binary data). Forexample, the estimated descriptor value at the boundary between activeand inactive is:$x = \frac{- 1}{\frac{\sum\limits_{{xi} < {{- b}/m}}^{\quad}\quad x_{i}}{a}}$

[0031] This is inversely proportional to the mean active descriptorvalue. Contrast the above equation, which was developed with linearregression, with −b/m, the true descriptor value at the boundary. Theassumptions of linear regression are not satisfied with binary HTS data.

OBJECTS AND SUMMARY OF THE INVENTION

[0032] It is an object of the present invention to provide a method fordeveloping a quantitative structure activity relationship that overcomesthe shortfalls of the prior art.

[0033] Another object of the present invention is to provide a methodfor developing a quantitative structure activity relationship thatallows the prediction of a candidate compound for a particular target tobe identified as either active or inactive.

[0034] A further object of the present invention is to provide a methodfor developing a quantitative structure activity relationship that isless sensitive to High Throughput Screening input data error andoutliers than the prior art.

[0035] Still a further object of the present invention is to provide amethod for developing a quantitative structure activity relationship andanalyze candidate compounds with the use of computer equipment.

[0036] Yet a further object of present invention is to provide a methodfor developing a quantitative structure activity relationship that isnot significantly influenced by data boundary effects.

[0037] Still a further object of the present invention is to predictwhether or not a chemical compound is a member of a particular set.

[0038] Yet another object of this present invention is to provide amethod for developing a quantitative structure activity relationshipthat includes obtaining a training set of chemical compounds withmolecular descriptors consisting of a number of multidimensional vectorswith an activity class for each of the vectors; partitioning themultidimensional vectors in groups having interdependence; transformingthe descriptors such that the interdependence of the groups is lessened;estimating a probability distribution of the descriptors by assumingthat the probability distribution of the product of each of the groupsis approximately equal to the probability distribution of the moleculardescriptors; performing the partitioning, transforming and estimatingsteps for each of the activity classes; and, developing a probabilitydistribution for the activity classes.

[0039] Still a further object of the present invention is to provide amethod for predicting activity of candidate ligands that includesdeveloping a prediction model; obtaining a candidate chemical compound;and, applying the prediction model to the candidate compound.

[0040] Yet another object of the present invention is to provide asystem for predicting activity of candidate compounds as either activeor inactive that includes an analyzer that receives a training set ofchemical compounds; a prediction model developed by the analyzer and isbased on the training set; and, a sorter that receives a candidateligand and receives the model from the analyzer, the sorter applies themodel to the candidate ligand to predict the activity of the candidateligand.

[0041] Still a further object of the present invention is to provide acomputer-based method of generating a quantitative structure activityrelationship that includes calculating a numerical representation ofmolecules consisting of n numbers per molecule; and, estimating aprobability distribution that a molecules is active.

BRIEF DESCRIPTION OF THE DRAWINGS

[0042]FIG. 1 is a flow diagram of the method of the present invention;

[0043]FIG. 2 is a flow diagram of the analyzer with its input andoutput;

[0044]FIG. 3 is a mathematical flow diagram of the analyzer with itsinput and output;

[0045]FIG. 4 is a mathematical flow diagram of the sorter with its inputand output;

[0046]FIG. 5 is a flow diagram of binary QSAR analysis in MOE; and,

[0047]FIG. 6 is a graph of accuracy versus active percentage compounds.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0048]FIG. 1 is a flow chart of the present invention, showing theoverall structure of the method of developing a discrete QuantitativeStructure Activity Relationship (QSAR), and applying the discrete QSARto candidate compounds to determine the probability that a particularcandidate will be active.

[0049] A training set of compounds 4 is obtained. Training set 4 areresults from High Throughput Screening (HTS) experiments. Training set 4may be data from other sources other than HTS, even virtual orhypothetical data.

[0050] Training set 4 comprises molecular descriptors to describe thechemical structures of the compounds, and the activity classes ordiscrete binding affinities associated with the descriptors.

[0051] Training set 4 is sent to an analyzer 8 to develop a model 12.Analyzer 8 is a computer. The functions of analyzer 8 may alternativelybe performed by some other means, even by hand calculations. Analyzer 8will be described further below.

[0052] Model 12 is a mathematical function that is the output ofAnalyzer 8. Model 12 is developed based on the 4 chemical structures ofthe training set 4 and the activity classes associated therewith, aswill be thoroughly discussed below.

[0053] Candidate compounds 16 and model 12 are sent to a sorter 20.Candidate compounds 16 are experimental data. However, candidatecompounds 16 may be from any source, even virtual or hypotheticalcompounds.

[0054] Sorter 20 applies model 12 to candidate compounds 16 to determinethe activity of each candidate compound 16 for a particular target.Sorter 20 is a computer. The functions of sorter 20 may be performed byother means, such as hand calculations. Sorter 20 will be describedfurther below.

[0055] Analyzer 8 and sorter 20 are connected together, to allow sorter20 to receive model 12, and to a display device, not shown. The displaydevice will allow a user to inspect the outputs of analyzer 8 and sorter20. The display device is preferably a computer monitor. However, thedisplay device may also be a printer, etc.

[0056] An example of a type of computer software program that willperform the methods described herein is entitled “Molecular OperatingEnvironment” available through a license from the Chemical ComputingGroup Inc. of Montreal, Quebec, Canada.

[0057]FIG. 2 displays the overall general process that analyzer 8performs, along with its input, training set 4, and its output, model12. The following is a description of the steps shown in FIG. 2; moremathematical detail is set forth below.

[0058] Training set 4 is characterized by a number of multidimensionalmolecular descriptor vectors with an activity class associated withevery vector.

[0059] The multidimensional descriptor vectors of training set 4 arepartitioned into groups, 32. This partition is arbitrary. The onlyrestriction is the fact that the higher the dimension, the more data isneeded and more computer memory is needed. The groups haveinterdependence.

[0060] As represented by 36, the molecular descriptors are transformedto lessen the interdependence of the groups. Transformation will bediscussed further below.

[0061] To estimate the distribution of the original moleculardescriptors, the product of the distributions of each of the groups isassumed to approximately equal the distribution of the originalmolecular descriptors, as represented by 40.

[0062] Steps 32, 36 and 40 need to be performed for each activity class,as represented by 44.

[0063] The distribution of the activity classes must also be estimated48.

[0064] With all of the distributions estimated, they are combined toestablish a prediction function, i.e. model 12, that will determinewhether a candidate compound belongs to a particular activity class.

[0065] The mathematical methods employed in the above steps will now beset forth for the particular case of a partition into groups of size 1and a single transformation applicable to all classes. FIG. 3 displaysthe mathematical flow diagram for analyzer 8 and FIG. 4 displays themathematical flow diagram for sorter 20.

[0066] Analyzer 8 accepts training set 4 data, which may becharacterized by {(y_(i),x_(i))}. y_(i) is represented by 52 and x_(i)is represented by 56. Training set 4 is the results of m HTS experimentson a common target. Thus, there are m (y_(i),x_(i))'s where the y_(i)are discrete values that without loss of generality, it may be assumedthey are numbers {1, 2, . . . ,k} and each x_(i) is a vector each with nnumbers (the molecular descriptors) and we write x_(i)=(x_(il), . . . ,x_(in)) represented by 60.

[0067] We will now introduce a random variable Y over the values {1,2, .. . ,k}, not shown in flow diagram, and a random variable over n-vectors(a random molecular descriptor), X=(X₁ . . . Xn), not shown in flowdiagram.

[0068] The conditional distribution Pr(Y|X) is used to determine theprobability that a new molecule L, not shown, belongs to activity classy with Pr(Y=y|X=L). The molecule can then be sorted into the class thathas the highest probability, mathematically represented using the Bayestheorem:${\Pr \left( {Y = {\left. y \middle| X \right. = x}} \right)} = \frac{{\Pr \left( {X = {\left. x \middle| Y \right. = y}} \right)}{\Pr \left( {Y = y} \right)}}{\sum\limits_{i = 1}^{k}{{\Pr \left( {X = {\left. x \middle| Y \right. = i}} \right)}{\Pr \left( {Y = i} \right)}}}$

[0069] To use this formula for practical purposes it is necessary toanalyze the HTS data in an effort to approximate the distributions onthe right hand side of the equation.

[0070] The prior distribution of Y is estimated using a maximumlikelihood estimator or a Bayes estimator. Any method of estimatingthese probabilities may be used. A Bayes estimator has been chosen,since it is well defined for all inputs, where Cj is the number of timesthat y_(i)=j in the HTS experimental data:${{\Pr \left( {Y = j} \right)} \approx {g(j)}} = \frac{C_{j} + 1}{m + k}$

[0071] Estimating the k distributions of the form Pr(X=x|Y=j) is moreproblematic since the X is a vector of n numbers: for values of n offive or more, a straightforward histogram build-up, or countingprocedure cannot be used in practice because there will not be enoughexperimental data to approximate the distribution with any reasonableaccuracy.

[0072] Our method to approximate the distributions of X is to transforma multidimensional distribution into a product f one dimensionaldistributions. However, it is understood herein, that rather thantransforming into one dimensional distributions, the multidimensionaldistribution may be transformed into simply a collection of lesserdimensional distributions. The idea is to partition the multidimensionaldistributions into smaller groups to reduce the dimensions to enableone, or a computer, to work with the data.

[0073] Thus, to decorrelate 64 each multidimensional vector x_(i), themethod of principal component analysis is used to determine a p by nlinear transform Q and a n-vector u, collectively 68, such that therandom variable Z=Q(X−u) has a covariance matrix equal to the p by pidentity matrix. For the purposes of approximation, it is assumed thatthe individual coordinates of Z are independent so that the followingapproximation can be made and is represented as 68 in FIG. 3:${{\Pr \left( {X = {{xY} = y}} \right)} \approx {\Pr \left( {Z = {{{Q\left( {x - u} \right)}Y} = y}} \right)}} = {\prod\limits_{i = 1}^{p}\quad {\Pr \left( {Z_{i} = {{z_{i}Y} = y}} \right)}}$

[0074] this task, let W be a random variable over the reals and let ƒ(w)be the probability density for W. The function, ƒ, can be estimated byaccumulating a histogram of the observed sample values on a set of Bbins(b0,b1], . . . ,(bB−1,bB] defined by B+1 numbers b,<bk+1,b₀ is minusinfinity and b_(B) is plus infinity. Any method of estimating continuousdistributions, other than the one explained here, may be employed.

[0075] The usual procedure for counting the number of observations amongm samples in bin k>0 is:$B_{k} = {{\sum\limits_{i = 1}^{m}\quad {\delta \left( {w_{i} \in \left( {b_{k - 1},b_{k}} \right\rbrack} \right)}} = {\sum\limits_{i = 1}^{m}{\int_{b_{k - 1}}^{b,}{{\delta \left( {x - w_{i}} \right)}\quad {x}}}}}$

[0076] This procedure has an unfortunate sensitivity to the selection ofbin boundaries since observations close to a bin boundary are treated asif they were in the middle of one of the bins. In view of thissensitivity, it is desirous to spread the observations out over thebins. In other words, rather than having a single observation point, theobservation will be blurred over several bins. Here, the blurring areais created by a bell-curve. However, any type of spreading or blurringmay be used.

[0077] Accordingly, to reduce the sensitivity to the bin boundaries, thedelta function in the above equation is replaced with a Gaussian, withvariance s₂. This can be thought as an observation error as well as asmoothing parameter. The equation now becomes:$B_{k} = {{\sum\limits_{i = 1}^{m}{\int_{b_{k - 1}}^{b_{k}}{\frac{1}{s\sqrt{2}}{\exp \left\lbrack {{- \frac{1}{2}}\frac{\left( {x - w_{i}} \right)^{2}}{s\sqrt{2}}} \right\rbrack}{x}}}} = {\frac{1}{2}{\sum\limits_{i = 1}^{m}\left\lbrack {{{erf}\left( \frac{b_{k} - w_{i}}{s\sqrt{2}} \right)} - \left( \frac{b_{k - 1} - w_{i}}{s\sqrt{2}} \right)} \right\rbrack}}}$

[0078] Using the above techniques an estimation 76, is made for pkdistributions from the HTS experimental data. In other words, weapproximate the one dimensional distributions with ƒ_(j)(z,y) for j in{1, . . . ,p} and y in {1, . . . ,k}. . . , ƒ_(j)(z,y), C_(y) isrepresented by 80 in FIG. 3. The final approximation, or model 12, withz=Q (x−u), is:${\Pr \left( {Y = {\left. y \middle| X \right. = x}} \right)} \approx \frac{\left( {C_{y} + 1} \right){\prod\limits_{j = 1}^{p}\quad {f_{j}\left( {z_{j},y} \right)}}}{\sum\limits_{i = 1}^{k}\quad {\left( {C_{i} + 1} \right){\prod\limits_{j = 1}^{p}\quad {f_{j}\left( {z_{j},i} \right)}}}}$

[0079] Now that model 12 is developed, predictions for a candidateligand c can be made in two ways. Depending on whether the activityclasses are an ordered scale, a user will choose the class that has themaximum probability, or use the expected class value for the prediction.

[0080]FIG. 4 displays the mathematical flow diagram for sorter 20. Theinput for sorter 20 is a candidate ligand c, 16, and model 12. Thetransform Q and n-vector u, 68, and ƒ_(j)(z,y), C_(y), 80, make up model12.

[0081] Candidate ligand 16 must go through the same process that eachx_(i) did as described above. These steps are represented as 84 and 88,which mimic 72 and 76 above. The output 24 of sorter 20 is the activityclass of candidate ligand 16.

[0082] The steps outlined above are to be performed with the use ofcomputer software and a computer. However, it is understood that thesteps may be performed by some other means, even by manual calculations.

[0083] Use of the present invention typically will be iterative withefforts directed at selecting, determining, discovering or inventingthose descriptors that lead to an accurate and predictive model ofbiological activity. A typical sequence of general QSAR steps, notnecessarily the inventive steps, are the following.

[0084] Obtain a collection of chemical structures and a collection ofactivity classes numbers {y₁} in the range 1, . . . ,k such that witheach chemical structure, there is an associated activity class number.

[0085] For each chemical structure, calculate a set of descriptorsx=(x₁, . . . ,x_(n)). The complete input data set will be the {(y_(i),x_(i))}.

[0086] Apply the procedure described herein and depicted in FIGS. 2 and3 using the input data set training set as the set of “candidates” toobtain a “model” consisting of (Q, u, f_(j),C_(y)) and a collection ofmodel predictions p_(i).

[0087] If the model predictions {p_(i)} are in substantial agreementwith the input activity classes {y_(i)} and the model is judged to besuitably “predictive”, then the model can be used to predict activityclass of new candidates. Otherwise, the model can be adjusted byreturning to the step of calculating a set of descriptors.

[0088] The model that is developed is then used to predict the activityclass of a (possibly novel) chemical structure by calculating the samedescriptors as were calculated in the step of calculating a set ofdescriptors for the model and by applying the model.

[0089] An objective of application of the present invention, as statedabove, is to build a model to predict the 0 or 1 class when presentedwith chemical structure.

[0090] The present invention requires a numerical description of boththe activity class and, for each activity class a vector of numbers (thedescriptors, or quantification of the chemical structure). The source ofthe initial data set is quite arbitrary so long as a set of descriptorscan be determined from the chemical structures. As mentioned above, thechemical structures need not refer to actual compounds; that is, theycan be virtual compounds or hypothesized compounds. The activity classescan be any arbitrary classification of the structure; in most cases,this activity class will be some quantification or classification ofbiological activity.

[0091] Because the source of the data set is arbitrary, it is impossibleto enumerate all possible ways in which a data set, or training set, canbe assembled.

[0092] Research into scientific literature regarding experiments withthe Carbonic Anhydrase II receptor revealed information describingphysical experiments to determine and quantify the binding affinity of avariety of chemical compounds. Each compound is given a numerical valueindicating its binding affinity. Table 1 depicts the nature of anexample initial data set. Beside each drawing in Table 1, is aquantitative experimental assessment of binding affinity.

[0093] The activity data can be converted into, for example, twoactivity classes, “active” and “inactive”, by comparison to a thresholdvalue. For example purposes, the threshold value is picked at 5.85. Ifthe activity value is less than 5.85, the activity class is 0.Otherwise, it would be 1. This results in the following data setdepicted in Table 2.

[0094] It is preferred that the preparation of the initial data setwould be performed by a computer. In such a case, the structures aredrawn with commercially available chemical drawing programs or chemicalinformation systems that return chemical structures. Such computerrepresentations of chemical structures typically encode the connectivityand element labels of chemical structures. Some representations encodethe depiction while some encode only the connectivity. For example, thesame data of Table 2 can be represented textually using SMILES strings(a character-based encoding of chemical structures).

[0095] The nature of encoding of the chemical structures is notcritical, as long as molecular descriptors can be calculated from thestructure encoding.

[0096] A molecular descriptor is a number calculated from a chemicalstructure. For example, if chemical structures are encoded usingchemical formulas, then the molecular weight of the structure is anexample of a molecular descriptor. The molecular weight is a number thatcan be calculated from the chemical formula. It is preferred that themolecular descriptors be calculated by means of a computer. However,they may also be derived by mental calculations from introspection andexamination of the data. Scientific literature contains many examples ofdescriptors used in QSAR studies. Examples of molecular descriptors are:molar refractivity; octinol/water partition coefficients; pKa; numbercarbons; number of triple bonds; number of aromatic atoms; sum of thepositive partial charges on each atom; water accessible surface areas;heat of formation; topological connectivity indices; topological shapeindices; electro topological state indices; structure fragment counts;van der Waals volume; etc. In general, the quality, accuracy andpredictiveness of the calculated model will depend on which descriptorsare chosen for a particular data set. Automatic and/or statisticalmethods are used to help select appropriate descriptors in the iterativemodel building procedure described herein.

[0097] As set forth above, the descriptors and activity classes are usedto estimate the model parameters. The model can be used to “predict” or“back test” the activity classes of the training set. The statisticalcross-validation procedures such, as “leave-one-out”, may be used toestimate the quality and predictiveness of the model.

[0098] When the results of the model building and descriptor selectionprocedure are judged suitably accurate, the iterative procedureterminates. Exact termination criteria cannot be specified sinceaccuracy and predictiveness will depend on the applications of themodel. For example, a relative high accuracy of the model will be neededif the model is to be used to search databases of available compounds inan effort to locate compound with activity class “1”. On the other hand,a less accurate model can be used if only a trend or gross indication ofactivity is required. In other words, the termination criteria areproblem dependent.

[0099] To use the calculated model to predict activity classes ofchemical structures not presented in the initial data training set, thefollowing is performed. Each new structure is prepared in the samemanner as the initial data set. The same molecular descriptors arecalculated for the new structure and the vector of descriptors is usedas input to the calculated model. The sorter, which utilizes the model,will output a predicted activity class. Typical uses of such a modelwould be compound data base searching, focusing on combinatoriallibraries, or de novo design (the attempt to create new molecules bymodification of chemical structures).

[0100] The following is based on and is a partial reproduction of:“Binary Quantitative Structure-Activity Relationship (QSAR) Analysis ofEstrogen Receptor Ligands”, Gao, H., Williams C., Labute P., Bajorath,J. J. Chem. Inf. Comput. Sci 1999, 39, 164-168, which is incorporatedherein by reference.

[0101] The above methods for discrete or binary QSAR correlate compoundstructures, using molecular descriptors, with a “binary” expression ofactivity, i.e., 1=active and 0=inactive, and calculates a probabilitydistribution for active and inactive compounds in a training set. Thisfunction can then be used to predict active compounds for a given targetin a test set. The present invention is applied below to a drugdiscovery problem, the analysis of estrogen receptor ligands.

[0102] The estrogen receptor is an extensively studied pharmaceuticaltarget for which a large number of ligand analogs have been generatedand characterized. In addition, structural studies have elucidated themechanism of the estrogen receptor-ligand interaction and identified thebinding determinants. The estrogen receptor binding affinity data ofestrogen analogs have been transformed into a binary data format. Apredictive binary QSAR model has been derived and this model has beenapplied to a test set of other estrogen analogs. Both active andinactive analogs were predicted with high accuracy. The binary QSARmodel was stable for a variety of binary activity cutoff values and themodel was quite insensitive to boundary effects.

[0103] The binary QSAR analysis procedure used in this study aregenerally depicted in FIG. 5. Binary QSAR estimates, from a trainingset, the probability density Pr(Y=1X=x) where Y is a Bernoulli randomvariable (i.e. Y takes on values of 0 or 1) representing “active” or“inactive” and X is a random n-vector or real numbers (a randomcollection of molecular descriptors).

[0104] A Principle Components Analysis (PCA) is conducted on thetraining set to calculate an n by p linear transform, Q, and ann-vector, u, such that the random p-vector Z=O(X−u) has mean andvariance equal to the p by p identity matrix. The quantity p is referredto as the number of principle components.

[0105] The original molecular descriptors are transformed by Q and u toobtain a decorrelated and normalized set of descriptors. The desiredprobability density is then approximated by applying Bayes' theorem andassuming that the transformed descriptors are mutually independent:$\begin{matrix}{{\Pr \left( {Y = {\left. 1 \middle| X \right. = x}} \right)} \approx \left\lbrack {1 + {\frac{{pr}\left( {Y = 0} \right)}{\Pr \left( {Y = 1} \right)}{\prod\limits_{i = 1}^{p}\frac{\Pr \left( {Z_{i} = {\left. z_{i} \middle| Y \right. = 0}} \right)}{\Pr \left( {Z_{i} = {\left. z_{i} \middle| y \right. = 1}} \right)}}}} \right\rbrack^{- 1}} \\{\quad {Z = {{Q\left( {X - u} \right)} = \left( {Z_{1},\ldots \quad,Z_{p}} \right)}}}\end{matrix}$

 Z=Q(X−u)=(Z ₁ , . . . ,Z _(p))

[0106] Each probability density Pr(Z_(i)=z_(i)) is estimated byconstructing a histogram. Conventional procedures for histogramconstruction are sensitive to bin boundaries since every observation, nomatter how close to a bin boundary, is treated as though it falls in thecenter of the bin. To reduce this sensitivity, each observation isreplaced with a Gaussian density with variance σ². This variance can beinterpreted as an observation error or as a smoothing parameter.

[0107] Once all of the 2p+2 probability densities have been estimatedfrom the training set, the desired Pr(Y=1/X=x) is constructed using theabove formula.

[0108] The binding data of estrogen analogs to estrogen receptors ofdifferent species was collected from literature. There is little, ifany, evidence for receptor-species difference in estrogen analogstructure-affinity relationships. There are two subtypes of estrogenreceptors, ER-α and ER-β. The data reported here is presumed to comefrom ER-α, since this subtype is the predominant one in uterine andbreast tissue. The binding data was placed on a common “relative bindingaffinity” (RBA) scale. Values on this scale were calculated as apercentage of the ratio or IC₅₀ values of test compounds to displace 50%of [³H]estradiol from estrogen receptor binding. Thus, on the RBA scale,estradiol has a value of 100, with lower affinity analogs having lowervalues and higher affinity analogs higher RBA values. A total of 463compounds were selected (tested for binding at 0 to 4° C.), 410 of whichwere used as a training set to derive a binary QSAR model, and 53compounds as a test set to evaluate the model by predicting active andinactive compounds. Table 4 shows the composition of estrogen analogsused in this analysis. The continuous biological activity data wasexpressed in binary form using a threshold criterion (log RBA). Anycompounds with log RBA larger than or equal to this criterion wereclassified as active, and any compounds with lower log RBA values wereclassified as inactive. Different activity threshold values were used toalter the percentage of active compounds in the training set.

[0109] Molecular descriptors were calculated using 1998.03 version ofMOE, from the Chemical Computing Group Inc. of Montreal, Quebec, Canada,and binary QSAR analysis was carried out with the MOE binary QSARfunction.

[0110] Performance of a binary QSAR model was measured as follows: letm₀ represent the number of active compounds, m₁ the number of inactivecompounds, c₀ the number of active compounds correctly labeled by theQSAR model, c₁ the number of inactive compounds correctly labeled by theQSAR model. Three parameters of performance were calculated: 1. accuracyon active compounds, c₀/m₀; 2. accuracy on inactive compounds, c₁/m₁; 3.overall accuracy on all of the compounds, (c₀+c₁)(m₀+m₁). The derivedbinary QSAR model was cross-validated by a leave-one-out procedure.

[0111] In this procedure, only one object is eliminated at a time andthe process is repeated until all objects have been eliminated once andonly once. Accuracy was calculated for each step, and an averageaccuracy for all the steps was reported as a measure of the internalpredictivity of the model within the training set.

[0112] A set of 410 compounds was chosen to be a training set to derivethe binary QSAR model. The range of the biological activities (log RBA)was −2.02 to 2.60. Table 5 shows the data profiles with differentthreshold values.

[0113] A value of 1.7 of log RBA which corresponds to 50% of RBA wasselected as the threshold to derive the binary QSAR model. Based on thisthreshold criterion, 62 compounds were active and 348 compounds wereinactive in the training set. A smoothing factor was introduced tominimize the sensitivity of the derived model to the selection of binboundaries as mentioned earlier. The binary QSAR model is alsoinfluenced by the number of principle components used. A 5×7 factoranalysis was carried out to determine the effects of different smoothingfactor values and principle component numbers on the binary QSARanalysis of the data set analyzed. Table 6 summarizes the results of theanalysis.

[0114] In this study, two-dimensional (2D) molecular descriptors wereused and were shown to perform well in compound clustering. In addition,Keir's shape indices were used, which contain implicit three-dimensional(3D) information. Explicit 3D descriptors were not considered to avoidbias of the analysis due to predicted conformational effects. Thedifferent combination of molecular descriptors have been systematicallyexplored to identify a set that captures structural characteristics ofestrogen analogs and resulting activities well. This was done for thelearning set similar to more conventional QSAR analysis.

[0115] Table 6 shows that an optimal binary QSAR model was obtained by acombination of principal component numbers of 12 and a smoothing factorvalue of 0.12. Using this combination, the non-cross-validated accuracyis 85% on active compounds, 93% on inactive compounds, 92% for all thecompounds. The cross-validated accuracy is 76% on active compounds, 93%on inactive compounds, and 90% for all the compounds. Any departure fromthese parameter values decreased the non-cross-validated and/orcross-validated accuracy. Thirteen molecular descriptors were used toderive the binary QSAR model (Table 7), including four atomicconnectivity indices, four molecular shape indices, one totalhydrophobic accessible surface area descriptor, one charge descriptor,one aromatic bond descriptor, and two indicator variables for specificfunctional group and molecular structure. One of the descriptor used isI,es. A number of desthylstilbestrol (DES) analogs are found to be morepotent estrogen receptor ligands than estradiol itself, despite theirstructure similarity (log RBA is 2.48 for DES versus 2.00 forestradiol). Because structure features that account for higher potencyof DES analogs were not obvious, the indicator variable I,es wasincluded to account for this effect. A phenolic OH group that resemblesthe 3-OH of estradiol molecule is required for tight binding to estrogenreceptor. To account for this specific structural effect, an indicatorvariable, I, OH, was used.

[0116] The effects of ten different threshold values (log RBA valuesranges from −2 to 2) on the binary QSAR model were analyzed (FIG. 2).Accuracy on active compounds ranged from 70% to 98%, with the highestaccuracy obtained for 98% active compounds and the lowest for 7% activecompounds. The overall accuracy remains stable at different thresholdvalues (around 90%). FIG. 6 shows that selected threshold values causefluctuation of observed overall accuracy by approximately 10%. Theminimum obtained overall accuracy is about 80%. Thus, on the basis ofthese findings, the overall binary QSAR accuracy remains stableirrespective of the chosen threshold values.

[0117] Compounds with biological activity near the binary thresholdvalue may fall into either the active or inactive category, which alsodepends on the experimental error. To analyze the influence of boundaryeffects on the binary QSAR model, compounds with log RBA values between1.0 and 1.7 were omitted. Therefore, in these calculations, binaryclassification corresponds to largest difference in biologicalactivities. This data set consisted of 292 inactive and 62 active(17.5%) compounds. In the resulting QSAR model, an accuracy 87% onactive, 95% on inactive, and 93% for all 354 compounds was achieved. Theperformance is only slightly better than that obtained for the originaltraining set. These results indicate that the boundary effects testedhave only marginal influence on the binary QSAR accuracy, indicatingthat the binary QSAR model is stable. The obtained accuracy is notcritically dependent on binary classification of observed activities,which is important with respect to the analysis of screening data.

[0118] In order to evaluate the predictive value of the binary QSARmodel, 53 randomly selected estrogen analogs were tested. Seven out of 9active compounds (78%), and 43 out 44 inactive compounds (98%) werecorrectly predicted (overall accuracy of 94%), which is consistent withthe cross-validation result. The percentage of active compounds in thetest set was 15%. If the compounds were selected and tested based on thebinary QSAR model, the “hit rate” of active compounds would be 5 foldhigher than randomly selected compounds even for this small data set.

[0119] The X-ray structures of the ligand binding domain of ER-αreceptor in complex with estradiol and raloxifene have been reported inthe past. The ligands are buried within the hydrophobic core of theligand binding domain, but the polar ends of estradiol form hydrogenbonds to the only polar amino acid residues in the binding site. Glu353forms a hydrogen bond to the A-ring phenolic hydroxyl group and His524forms a hydrogen bond with 17β-hydroxyl group. The phenolic hydroxylgroup is required for binding. The 3-OH group on estradiol can act as ahydrogen bond donor or acceptor, but the hydrogen bond donor ability ismore important than the acceptor ability in stabilizing the complex.3-Keto and 3-methyl ether derivatives have much lower binding affinitiesbecause they lack a hydrogen bond donor. The aromatic ring system isrequired for strong binding because analogs lacking aromatic moietieshave only low binding affinity. It follows that structural differencesbetween active and inactive compounds are distinct but may be quitelimited. The estrogen analogs are considered to be a challenging testcase for binary QSAR analysis because of small structural modifications,which actually change binary activity in a more continuous way, areconsidered here to render compounds either active or inactive.

[0120] Estrogen analogs have also been studied by conventional orclassical QSAR techniques. Earlier QSAR studies on estrogen analogs didnot reveal a consistent positive hydrophobic contribution forreceptor-ligand binding, except substituents at the 11-β position ofestradiol derivatives, although hydrophobicity expressed as log P(o/w)differs significantly among the analogs. Similarly, in the binary QSARmodel, log P(o/w) was not found to be a significant descriptor. Incontrast, ASA-H (which does not strictly correlate with log P(o/w)(r²=0.62)) was found to be significant. This finding suggests that thestrength of van der Waals/hydrophobic interactions between ligands andreceptor is more important than the differences in energy required todesolvate the hydrophobic ligands.

[0121] Conventional QSAR based on regression techniques, such asmultiple linear regression, partial least squares and, occasionally,neural networks, have been used to cluster compounds. These methods seekto minimize the squared error between the model and the observed data.This optimization of the model parameters introduces sensitivity toerrors in experiments and regression analyses. In contrast, binary QSARdoes not use any form of regression analysis; there is no attempt tominimize the model errors with regard to model parameters. It is anonlinear modeling method. Because no regression is used, the modelestimation procedure is very fast, which is in contrast to neuralnetworks that require a lengthy training phase. Therefore, binary QSARcan efficiently process large data sets such as HTS data.

[0122] Several other clustering methods have been tested to classifycompounds into different clusters. These methods are qualitative in thatthey are based on only chemical structural information regardless ofbiological activities. Compounds with similar structural features areclustered together. However, compounds with similar biologicalactivities may appear in different clusters depending on their degree ofstructural similarity. In this case, identification of active clustersmay be a nontrivial task. In contrast, binary QSAR takes both structureand activity information into account, and deduces a probabilitydistribution function for novel compound to be either active orinactive.

[0123] While this invention has been described as having a preferreddesign, it is understood that it is capable of further modification,uses and/or adaptions following in general the principle of theinvention and including such departures from the present disclosure ascome within known or customary practice in the art to which theinvention pertains, and as may be applied to the essential features setforth, and fall within the scope of the invention or the limits of theappended claims. TABLE 1 STRUCTURE ACTIVITY

5.80

5.92 . . . . . .

6.40

[0124] TABLE 2 STRUCTURE CLASS

0

1 . . . . . .

1

[0125] TABLE 3 STRUCTURE (SMILES) CLASS CC(C)NS(O) (O)C1 = NN − C 0(NC(C) = OS1 CNS(O) (O)clcccc([C1])C1 1 . . . . . . NS(O) (O)clccc(N = C− 1 c2c(O)cccc1)cc1

[0126] TABLE 4 Composition of Estrogen Receptor Ligands number ofCategory representative structure compounds Estradiol derivatives

165 3-keto steroids

 2 nonaromatic analogs

 4 metahexstrol derivatives

 15 hexestrol derivatives

 50 diethylstilbestrol derivatives

 10 tryphenylethylene analogs

 40 2-phenylbezothiopene analogs

 68 2-phenylindole analogs

 61 indene analogs

 45 Phenol and biphenols

 3

[0127] TABLE 5 Data Profiles at Different Binary Threshold Valuesthreshold value active inactive (log RBA) compounds compounds active %−2.0 404  6 98% −1.5 394  16 96% −1.0 382  28 93% 0.0 307 103 75% 1.0177 233 43% 1.2 146 264 36% 1.5  92 318 22% 1.7  62 348 15% 1.8  53 35713% 2.0  27 383  7%

[0128] TABLE 6 Effects of PCA Number and Smoothing Factor on Binary QSARPCA smoothing factor no. 0.08 0.10 0.12 0.14 0.16 0.20 0.25 6 0.79 0.760.74 0.69 0.69 0.60 0.52 0.63 0.61 0.60 0.60 0.55 0.48 0.45 8 0.81 0.770.77 0.76 0.76 0.73 0.66 0.71 0.71 0.71 0.69 0.68 0.65 0.55 10 0.85 0.840.84 0.84 0.81 0.77 0.73 0.68 0.68 0.66 0.68 0.69 0.68 0.65 12 0.85 0.850.85 0.82 0.81 0.81 0.79 0.69 0.71 0.76 0.76 0.73 0.73 0.71 13 0.85 0.850.82 0.82 0.82 0.82 0.79 0.63 0.66 0.69 0.69 0.69 0.66 0.68

[0129] TABLE 7 Molecular Descriptors Used in the Binary QSAR symboldescription b-ar number of aromatic bonds ASA-H total hydrophobicaccessible surface area ⁰X zero-order atomic connectivity index ⁰X^(v)zero-order atomic valence connectivity index ¹X first-order atomicconnectivity index ¹X^(v) first-order atomic valence connectivity index¹K Keir first shape index ²K Keir second shape index ³K Keir third shapeindex Φ Keir molecular flexibility index Peoe-PC⁺ total of positivecharge in Gasteiger & Marsili charge model I,OH indicator variable forphenolic hydroxy group; I,OH = 1 for compounds containing phenolic OHand 0 for other compounds I,es indicator variable for hexestrolderivatives; I,es = 1 for hexestrol compounds and 0 for other compounds.

What is claimed is:
 1. A computer-based method of generating aquantitative structure activity relationship comprising: a) calculatinga numerical representation of molecules consisting of n numbers permolecule; and, b) estimating a probability distribution that a saidmolecules is active.
 2. A method as recited in claim 1, wherein: a) saidestimating step is calculated with Bayes Theorem.
 3. A method as recitedin claim 1, wherein: a) said probability distribution of said estimatingstep comprises n one-dimensional distributions.
 4. A method as recitedin claim 1, wherein: a) said estimating step is performed by using ameans to remove linear correlations between said n numbers per molecule.5. A method as recited in claim 4, wherein: a) said means to removelinear correlations between said n numbers per molecule is a principalcomponents analysis.
 6. A method as recited in claim 4, wherein: a) saidmeans to remove linear correlations between said n numbers per moleculeis a matrix diagonalization.
 7. A method as recited in claim 1, wherein:a) said estimating step is performed by using a means to removedependencies between said n numbers per molecule.
 8. A method as recitedin claim 7, wherein: a) said means to remove dependencies between said nnumbers per molecule is a principal components analysis.
 9. A method asrecited in claim 7, wherein: a) said means to remove dependenciesbetween said n numbers per molecule is a matrix diagonalization.
 10. Amethod as recited in claim 1, wherein: a) said estimating step isperformed by estimating a distribution over a single number.
 11. Amethod as recited in claim 1, wherein: a) said estimating step isperformed by replacing a single observation with a Gaussiandistribution.